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and multiplication in a parallel and fault tolerant manner. Because of this, they hold 
significant promise for use in digital signal processing, where high speed arithmetic 
operators are needed. However, the difficulties in using RNS, such as magnitude 
comparison between two RNS values, division, and determining overflow or under- 
flow out of system range, have prevented more widespread use of these systems. This 
thesis investigates traditional methods to perform comparisons and to propose some 
new ones. Proposals include residue number system with quotient (RNS-Q), residue 
number system quotient-on-demand (RNS-QD), and pipelined conversions from tra- 
ditional RNS to a mixed radix representation. These proposals will be compared with 
traditional methods with respect to silicon area needed for implementation, speed 
with which they can be developed, and VLSI techniques utilized to carry out the 
design. 
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I. INTRODUCTION 



A. BACKGROUND 

Residue Number Systems (RNS) have been investigated for quite some time for 
use in computer arithmetic implementations. There have been many stumbling blocks 
that have limited, or at times prevented, them from becoming more commonplace in 
computer systems. Frequently encountered difficulties are division, sign determina- 
tion, detecting underflow and overflow, and comparing two RNS values. The primary 
focus of this thesis is the development of more efficient methods of comparing two 
residue numbers. 

RNS representations lend themselves best to applications that require frequent 
addition, subtraction, and/or multiplication. These operations take advantage of the 
carry-free and parallel nature of residue arithmetics. Hence, they are ideally suited 
for signal processing techniques [Ref. 1]. They have also been shown to be of potential 
value in solving linear equations that are ill-conditioned [Ref. 2]. Researchers have 
done substantial work in the areas of optimizing arithmetic manipulations performed 
by RNS, but the basic means of comparison has involved a form of table look-up 
for a conventional weighted number value [Ref. 3, 4, 5]. Table look-ups tend to 
require large amounts of silicon area and are not very efficient in terms of speed of 
conversion, thereby making them the primary bottleneck preventing more widespread 
use of RNS. 

This chapter includes the essential material needed for a basic understanding 
of modular/residue mathematics. Section C.6, in this chapter, provides some basic 



1 



\ LSI design considerations and tradeoffs. It is not intended to be comprehensive, 
but should help one to understand the fundamentals. 

B. HISTORY 

Residue, or modulo, number systems have been identified since approximately 
100 A.D. Their discovery has been jointly credited to China’s Sun Tzu and the Greek 
mathematician Nichomachus [Ref. 2]. Credit seems to be most commonly given to 
Sun Tzu for a verse he wrote describing a three modulus RNS with prime moduli 3, 
5, 7 repeated here: 

We have things of which we do not know the number 
If we count them by threes, the remainder is 2, 

If we count them by fives the remainder is 3, 

If we count them by sevens the remainder is 2, 

How many things are there? 

The answer, 23. 

The rule stated in the verse has come to be known as the Chinese Remainder Theorem 
(CRT). 

During the Ming Dynasty (136SAD - 1643AD) Hsin Tai-Wei may have pub- 
lished the first proof of the CRT in a verse entitled “Hun Hsin Tiang Bing” (Counting 
Soldiers). Hsin Tai-Wei’s verse is as follows: 

Three men walk together, their chance of reaching seventy so slight. 

Among the five plum trees, twenty-one blossoms did they yield. 

Seven sons at midmonth, happily did reunite. 

Divide the sum by 105, the answer is revealed. 
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This verse reiterates the modulo system with relatively prime moduli {3, 5, 7} and 
possessing a dynamic range of 105 (3*5* 7). Euler is credited with the first rigorous 
mathematical proof of RNS published in 1734. Gauss also published this theorem and 
the overall theory of residue numbers in the nineteenth century in his Disquisitiones 
Arithmetical. [Ref. 1] 

Although the bulk of the theoretical development of residue number systems had 
been completed by the end of the nineteenth century, there had been little practical 
use found for them. With the advent of the electronic computer new interest was de- 
veloped in RNS methods. There was a flurry of activity after World War II primarily 
focused on the error detecting capabilities of the system to make vacuum tube com- 
puters more reliable. Fault tolerant and error detection/correction research became 
less important when the transistor was invented, which dramatically increased com- 
puter reliability. Increased activity in the RNS field was noted in the 50’s and 60’s, 
as attempts were made to use RNS in general-purpose computing machines. The 
difficulties encountered in handling sign detection, division, and comparison made 
RNS implementation undesirable in these machines. 

Digital signal processing began to emerge as a significant distinct field of research 
in the 1960’s, separating it from general computing machines. Cheney designed a 
digital correlator that was based entirely on residue arithmetic in 1961 [Ref. 6], 
Unfortunately, this development did not receive much attention and there was again 
a lapse in the intensity of research conducted on residue arithmetics. VLSI (very large 
scale integration) rapidly accelerated electronic development in the 1970’s. New VLSI 
tools created new techniques for system design and gave rise to new problems for DSP 
researchers. Traditional methods employed for digital signal processing were not very 
modular nor parallel in nature. Modularity and parallelism are two key issues when 
considering a VLSI design implementation (discussed in a later section); these two 
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issues brought RNS research back to the surface again in an effort to take advantage 
of the modular and parallel characteristics that are inherent to residue arithmetics. 
VLSI issues bring us to the present day, where we are still in need of more efficient 
methods for comparing two residue numbers. 

Throughout the history of residue arithmetic development there are many pe- 
riods of inactivity. Periods of disuse have been driven by development of techniques 
that were faster than what could be implemented using RNS methods, and lack of 
totally efficient implementations of modulo systems. During the eighteenth and nine- 
teenth centuries some of the giants in mathematics did extensive work on theoretical 
development. The list includes Euler and Gauss. Modern researchers have often 
stated that we may be rediscovering facts about RNS that have been lost in time[ 
Ref. 1]. Knuth states, “Perhaps some day highly highly parallel computers will make 
simultaneous operations commonplace, so that modular arithmetic will be of signifi- 
cant importance in ‘real-time’ calculations when a quick answer to a single problem 
requiring high precision is needed.” [Ref. 2] The days of parallel computing are be- 
coming more and more commonplace. There is a need for more rapid calculations, 
especially in the field of digital signal processing, that can utilize the advantages of 
RNS methods, while suffering very little from any of the disadvantages. 

C. BASIC CONCEPTS 
1. RNS 

Residue Number Systems are formed by selecting several relatively prime 
moduli. Relatively prime refers to the fact that none of the moduli have any common 
factors other than unity i.e., for two distinct moduli mj and m k are relatively prime 
if and only if 

gcd(mj,m k ) = 1. (1.1) 
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There are a variety of methods available to select relatively prime moduli [Ref. 2]. 
We are not forced into having to look for special distinct primes of the Mersenne type 
or anything else that exotic [Ref. 7]. One easy way to derive a set of three relatively 
prime moduli is done by using 2 n — 1, 2", and 2” + 1 as the set. Common sense can 
also be a big help in determining if a choice of a moduli set is relatively prime. 

Another important fact about Residue Number Systems is that they form 
finite or Galois fields. This is significant in that it establishes their inherent error 
detecting capabilities. The set S m = {0, 1, ... , m — 1} together with modulo m 
addition and multiplication forms a finite algebra, denoted {S'™, +, *}. If m is prime, 
then the set {S'™, +, *} forms a finite or Galois field. However if m is not prime, then 
{•?„, + . *} is a finite ring R(m). Fields are essentially a set of elements in which we 
can perform the simple arithmetic operations of addition, subtraction, multiplication, 
and division without leaving the set. A finite field is, in the most basic terms, a field 
with a finite number of elements [Ref. 8]. 

Finite rings have a much weaker structure than do fields. One of their 
weaknesses is that there is no multiplicative inverse for all ring elements and no 
generator exists to generate all the elements of the ring [Ref. 1]. Of special interest is 
a ring that is formed with a modulus choice of 2 n , Obviously a power of two modulus 
will not be prime, so all these implementations will be rings. An RNS implementation 
can benefit from the choice of an exact power-of-two moduli in that the representation 
is an the length of standard words in most computers. For a three moduli set, this 
makes the other two choices quite easy, in that 2 n , 2 n — 1 and 2 n + 1 are all relatively 
prime to each other. 

2. Moduli Set Choice 

The choice of the moduli is governed by the range of distinct values one 
wishes to represent. The range of the system is determined by the product of the 
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moduli. 



Range = M = m r * m r _ a * ... * m 2 * mi (1.2) 

A system with relatively prime moduli 3, 5, and 7 would have a dynamic range of 
105, the product of the moduli. The implications of the dynamic range are that these 
are the total number of values that can be uniquely represented by a residue number 
system. A moduli set must be chosen such that numbers in the system in which it 
is to be used do not typically go out of this range. When an overflow (or underflow) 
occurs the resultant RNS representation is an alias of some other value, and cannot 
be differentiated from that value. An example of this for the {3, 5, 7} RNS is that 
the value of 106 for this system is the same as the value for 1, i.e. (1, 1, 1) and would 
be interpreted as the value one if a conversion is performed. 

Typically RNS have been made up with 3 or 4 moduli, but this is in no 
way meant as a limitation. One helpful hint is that the largest moduli would best 
serve the overall system implementation if it is a direct power of two, as explained 
before. The choice of a power of .two holds other advantages than just word length; 
there is research that shows modulo adders and multipliers can be implemented at 
significant savings in terms of area and also gain some speed advantages for direct 
power of two implementations [Ref. 9]. 

Basically the idea is to work indirectly on the 'residues’ instead of directly 
on some larger integer value. By doing this we can reduce the storage requirements 
for intermediate results and take advantage of the rapid addition, subtraction, and 
multiplication of these residual values. 

3. Chinese Remainder Theorem (CRT) 

The Chinese Remainder Theorem (CRT) is the basic building block for all 
residue number system development. It is undoubtedly one of the oldest theorems 



6 



still in use today. Mathematically, the CRT can be restated by the following theorem 
[Ref. 2]: 

Theorem 1 Let mi, m 2 , ...,m T be positive integers which are relatively prime in pairs 
( as previously stated above ). Let M = mi *m 2 * ■ ■ ■ *m r and let a, ui,u 2 , ..., and u T 
be integers. Then, there is exactly one integer u, which satisfies the conditions a < 
u < a -f m and u = ujmodulo mj for 1 < j < r. 

The proof of Theorem 1 is as follows: 

Proof 1 If u = v(modulo mj) for 1 < j < r, then u — v is a multiple of m : for all j , 
so Equation 1.1 implies that u — v is a multiple of m = mim 2 . . . m T . This argument 
shows that there is at most one solution to a < u < a + m. As u runs through the 
m distinct values a < u < a -f m, the r-tuples (u mod mi , . . . , u mod m T ) must also 
run through m distinct values, since Theorem 1 has at most one solution. But there 
are exactly mim 2 . . . m r possible r-tuples (vi . . . v T )such that 0 < Vj < mj. Therefore 
each r-tuple must occur exactly once, and there must be some value of u for which 
( u mod mi , . . . , u mod m T ) = (uj , . . . , u T ). 

The CRT is the starting point for all RNS work, although some other techniques have 
been tried. 

4. Mixed Radix Representation 

Another form of representation for RNS is called the mixed radix repre- 
sentation (or system). An advantage of this form of representation is that it is a 
weighted format such that comparisons may be performed without further conver- 
sion. One method for performing the conversion from RNS to a mixed radix form is 
described in the following equations. 

Vi = ui(modulo mi) — Ui (1.3) 
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v 2 = [(u 2 - * cn]{Tnodulo m 2 ) (1.4) 

r 3 = {((u 3 — vi) * C 13 — v 2 ) * c 23 ](modulo m 3 ) (1.5) 

v r = (. . . ((u r - i>i) * c ir - v 2 ) * c 2r Ur-i) * C( T . X ) T (modulo m r ) (1.6) 

U = v T m T -i ■ ■ -mi + ■ ■ ■ + v 3 m 2 mi + v 2 mi + vi (1.7) 

These equations describe the conversion process where the u;’s are the original RNS 
representation values, the u,-’s are the mixed radix values, and U is the fully converted 
value to some decimal or binary form. Calculating the conversion constants, c,j’s is 
accomplished by Euler’s equation. 

Ci-jirii = 1 (modulo rrij) ( 1 . 8 ) 

It is important to note again that the V{ form of the mixed radix is a 
weighted number and may be compared directly to another value. The format of 
equations 1.4- 1.8 illustrates how each value u,- is dependent on the preceding value, 
Vi_i, and all earlier values of ids. Due to this cascading of dependency on previously 
calculated values, conversion into this type of representation lends itself well to a 
pipelined form of conversion. Full conversion to the value U is not required if only 
a comparison is desired and could be enabled or disabled as necessary. An example 
is that the mixed radix form of the numbers (using the moduli set {7, 5, 3}) 35 and 
23 are (2, 1, 2) and (1, 2, 2) respectively, while in RNS they would be (0, 0, 2) and 
(2, 3, 2). Looking at the mixed radix form it is obvious which represented number 
is larger, but this is not true for the RNS representation; in fact the value for 23 
"appears" to be larger than the value for 35. Investigation into more detail of the 
pipeline implementation will be done in Chapter III. 

5. Redundant Residue Number Systems 

Redundant residue number systems (RRNS) are defined as residue number 
systems with additional redundant moduli. A choice is made of n moduli, called the 
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nonredundant moduli as in any residue number system, with an additional r relatively 
prime redundant moduli. The extra r moduli are not considered in the calculation of 
the range M of the system. The system’s legitimate range remains the product of 
the nonredundant moduli, as in the equation 1.9. 

M = f{ m, (1.9) 

1=1 

The additional redundant terms form a product to define the illegitimate range as 
shown in Equation 1.10. 

n+r 

R = PI m{ (1-10) 

i=n+l 

The overall number of unique values that can be represented is indicated by 

n+r 

MR = n m. (1-11) 

t=i 

which includes the redundancy i?.[Ref. 10] 

The following is an example of how a number could be represented using 
a RRNS implementation. Using the familiar RNS with moduli {7, 5, 3} the decimal 
number 23 is represented by the three-tuple (2, 3, 2). Adding the relatively prime 
redundant moduli of {8, 11} results in two additional terms (7, 1). Putting it all 
together we have the five-tuple (2, 3, 2, 7, 1). 

Using redundant moduli allows for greater error checking and correction 
capabilities, thereby making the overall system more fault tolerant. Watson and 
Hastings have done research on RRNS that detect any errors in the residues and 
correct one of them [Ref. 10]. There are also algorithms for burst error detection 
and correction available for RRNS implementation [Ref. 11]. This capability makes 
the implementation desirability of RNS methods even greater when a strong degree 
of fault tolerance is required. The RNS with quotient implementation is a form of a 
redundant system introduced in Chapter II, only it is not formed using extra moduli. 
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6. VLSI Overview 



There are many tradeoffs to be considered in any engineering design pro- 
cess. Primary design considerations when undertaking a VLSI (very large scale 
integration) design are modularity, regularity, area, and development time. These 
are not the only items to consider, but they form the cornerstones for a good design 
approach and are in keeping with the spirit of VLSI. 

Modularity is a concept that takes into account, to some extent, the ease 
of mobility of a functional block within a given overall circuit and also its value in 
other circuit implementations. If a VLSI module is “well formed” the interaction it 
undergoes with other parts of the circuit can be easily and succinctly characterized. 
A highly modular circuit can be thought of as a properly written software subrou- 
tine. The subroutine can be embedded or called by a variety of main programs or 
other subroutines and only depends on what variables are passed into it and what 
variables it must return to the calling program. Likewise, the calling program need 
not be concerned with the internal operations of the subroutine, only that it returns 
the desired result when needed. A poorly written subroutine would rely on global 
constants or variables within a given program, thus making it, in its present form, 
highly immobile and therefore not very modular. Modularity of subroutines is basi- 
cally the same concept in VLSI design. Primary differences are that the interface in 
the circuit design is a physical boundary that must be connected vice the passing of 
variables in a subroutine. [Ref. 12] 

Regularity in a VLSI design is important to both the speed of development 
and also to the modularity. Optimizing each and every functional element in a circuit 
may result in a significant savings in total silicon area used and may also yield the 
highest speed of operation. The drawback to this approach is the long design time 
and the lack of a guarantee of a modular circuit. Using standard cells that have been 
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set up in an on-line library and using an iterative process for interconnection of the 
cells to form a functional block generally leads to faster design time and a modular 
cell. This is where the cost function AT (area * time delay) must be considered when 
deciding on a full custom design or a high degree of regularity [Ref. 13]. 

Area and time considerations for a VLSI design are fairly closely interlocked 
as evidenced from the previous paragraphs. To accomplish a particular design in 
the least amount of space implies a full custom design with every circuit optimized. 
Implementation of a circuit in the least amount of time leads one to rely entirely 
on library cells for circuit realization. There is also the speed of the circuit to be 
considered; it must be fast enough to be compatible with the system that will be using 
it. If a full custom circuit enjoys mass production and must perform at the highest 
speeds possible, then the time required to develop it may be justified. However, if it 
is a limited production circuit without any serious speed requirements or it is a design 
prototype to investigate feasibility for implementation, then the quickest development 
time is preferable. 

7. Programmable Logic Arrays (PLA) 

PLAs are highly regular and modular VLSI structures. They are composed 
of an AND gate array followed by an OR gate array. The PLA is a subset of the ROM 
structure [Ref. 14]. PLAs allow the VLSI designer to implement any combinatorial 
logic function that may be characterized by a sum of products (SOP). An example 
of a SOP is given in equation 1.12. below. 

X = ABCD + ABCD + ABCD (1.12) 

One primary difference between a ROM and a PLA is that a ROM can implement any 
combinatorial function desired given the number of inputs and outputs. The PLA 
requires the function to be implemented must be expressed as a SOP. Advantages of 
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using a PLA over a ROM are that there may be a substantial savings in terms of 
silicon area used in the implementation of a PLA. Some ROM structure applications 
have man\' unused output functions. The ROM may be thought of as a post office 
that maintains mailboxes for all of the people on the mailing list but only has mail for 
a small portion of those addressees on any given day. The mail boxes are always there 
but only a few are used regularly. Using a ROM to realize 8 functions of 16 variables 
requires a 65,536 8— bit word structure [Ref. 15]. This same function may well be 
realized at substantial area savings using a PLA and SOP minimization techniques. 

The facts outlined above concerning design modularity and regularity cou- 
pled with available tools for SOP minimization and PLA realization led to the choice 
of PLAs implementation which is investigated in this thesis. The tools used were 
espresso , eqntott , mpla, magic and ml These tools are all described in “Still More 
Works by the Original Artists” [Ref. 16]. After considering other approaches, this 
seemed to be the best to minimize the overall development cost and was in keeping 
with the regularity advantage of VLSI design. 

D. THESIS OVERVIEW 

This thesis concentrates on investigating traditional comparison methods and 
offering some alternative solutions to this problem. Background work has been pre- 
sented to develop RNS basics. Chapter II investigates conventional comparison meth- 
ods, proposes new techniques for conversion, and analyzes the different comparison 
techniques in order to determine the overall efficiency of each method. Comparison 
efficiency is driven by the speed at which this operation can be performed and by 
the overall savings offered by the modulo system as compared to current methods. 
Conversion is a limiting factor if the values being used require frequent manipulation 
in a weighted or conventional format, that makes the operations of division, scaling. 
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and comparison easier to perform. Chapter III concentrates on the implementation 
of the alternative methods that have been proposed. This includes the steps taken to 
realize the design in a VLSI layout and design verification. Conclusions and recom- 
mendations for future study are presented in Chapter IV. 
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II. ANALYSIS 



The traditional conversion method was previously alluded to in Chapter I and 
is examined in more depth in this chapter. It is to be analyzed in terms of its good 
points and drawbacks. Proposed alternatives are introduced and analyzed as far 
as what benefits can be derived from their implementation. Comparisons are made 
between both the new alternative solutions and the conventional method. Design 
tradeoffs between the different methods are also discussed. 

A. NAIVE SOLUTION 

ROM table look-ups are the most commonly used method for converting an 
RNS value to some other form to facilitate comparison. The term “naive solution” 
is not meant to imply that designers are naive for choosing this solution to the com- 
parison problem. It is used to suggest that this is the most straightforward solution 
and is simply implemented. The overall cost function has not been fully evaluated 
in considering this alternative, which lies at the root of any VLSI implementation. 
There are some valid reasons for utilizing this approach, and they will be examined. 

Substantial drawbacks to ROM look-up tables lie in the fact that their size grows 
as the dynamic range of the moduli set being used, ROMs are relatively slow devices, 
and typical implementations require the use of multiple ROM tables. A multiple 
table look-up is given as an example in Figure 2.1 [Ref. 17]. Once the conversion 
has been accomplished, there is still the matter of performing the comparison, which 
is handled by traditional comparator circuitry. The circuitry in Figure 2.1 forms as 
an intermediate step the mixed radix representation of the residue number. This 
could be further optimized by imposing conditions so that when only a comparison is 
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Figure 2.1: Example of a ROM Based RNS to Binary Converter [Ref. 17] 
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desired the mixed radix value could be utilized for the comparison and not wait on the 
full conversion. Doing this would reduce the time delay required for full conversion 
when actually not needed. Full conversion could be completed if the value is actually 
required for some external purpose. 

B. PROPOSED ALTERNATIVE SOLUTIONS 

The following three sections introduce alternatives to the “naive solution.” Only 
one of them (the RNS-QD method) has been implemented and verified with a VLSI 
layout, the other two have been theoretically developed. 

1. RNS with Quotient 

This alternative uses the traditional RNS implementation with the addi- 
tional element of the quotient, of the largest modulus, being part of the system. The 
name RNS with quotient (RNS-Q) comes from the fact that this quotient is now part 
of the representation. Advantages of this concept are that the quotient of any modu- 
lus and its residue form a unique value and are “ordered” in terms of magnitude, thus 
allowing direct comparisons to be accomplished. Conversions are only required with 
this system when a traditional value is required for output or some other use, but 
not to perform comparisons. The comparisons can be accomplished using traditional 
methods with “off the shelf” components. 

The motivation for using the RNS-Q method is derived from Theorem 2. 
This theorem and its proof show the validity of such a representation. 

Theorem 2 Let M = m T . . .m-imi with m r , . ..,m 2 , and m\ all relatively prime. 
Then, the representation of {i/mj, i mod m r , . . . ,i mod m 2 ,i mod m\) is unique for 
any i 6 [0, M — l]. 

Proof 2 If a ^ b and a and b 6 [0, M — 1] using an r moduli system with moduli 
m r , ... m 2 , and mi, with M — m r ■■■ m 2 m\. The only parts of this in doubt are 
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the a/nij and b/nij , because the rest of the representation are the traditional RNS 
and they are unique as a consequence of the CRT. If a/rrij and b/rrij are not unique 
representations, which they may well not be, the uniqueness of the remaining portion 
of the value preserves the uniqueness of the entire representation. QED. 

RNS-Q representation may be thought of as the quotient serving as a redundant value, 
one that is not required to ensure uniqueness, but allows for immediate comparison 
without conversion. The RNS-Q system also forms a partition or an equivalence class 
of the set [0, M — 1] which is called the quotient set M/mi [Ref. 18]. 

Drawbacks to this implementation are that extra bits are required to retain 
the quotient of the largest modulus, the system is no longer carry-free, and there is 
a potential loss of some of the fault tolerance of the traditional RNS. There will also 
be the extra burden of checking for overflow or underflow into or out of the largest 
moduli in order to update the quotient properly. 

Loss of the carry-free characteristic poses the largest potential problem. 
Carry-free loss is due to the fact that whenever any arithmetic operation is performed 
on the RNS-Q numbers overflow out of, or underflow into, the largest residue must 
be reflected in the quotient also. With true RNS there is no requirement to check for 
the occurrence of overflow or underflow during normal operations. Traditional RNS 
does suffer from an aliasing problem when underflow or overflow occur. Failure to 
account for overflow and underflow will result in an invalid quotient. Overflow from 
the largest residue indicates that the quotient must be incremented by one, while 
underflow implies we must decrement the quotient once. 

Table 2.1 outlines an RNS implementation using the relatively prime mod- 
uli set {2, 3, 5}. This system possesses a dynamic range of 30 values (0-29), and was 
chosen to illustrate some of the different representations that can be made. Shown 
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TABLE 2.1: Illustration of RNS Implementations 
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are the traditional RNS, the quotients of the largest modulus, RNS-Q, and the mixed 
radix representation. It is easy to see that the traditional RNS does not yield a nicely 
weighted system of numeric values that we are accustomed to working with. Mixed 
radix and RNS-Q are also easily verified as ordered number systems and show that 
they may be compared without further manipulation. 

It is obvious from Table 2.1 that {4, 2, 1, 0} is greater than {3, 3, 0, 0}. 
Our common sense would also tell us that {3, 0, 0} is greater than {2, 1,0}, which 
is not true in the traditional RNS representation. The RNS-Q entity has the “look 
and feel” of of a traditional weighted decimal system (when ordered: q$, r 5 , r 3 , r 2 ). 
This is because the quotient carries the most weight, just as a digit in the hundreds 
column of a decimal number carries more weight than a digit in the tens column. 

The greatest advantage derived from an implementation of the RNS-Q 
system is that no conversion is necessary for comparison. Complete conversion can 
be accomplished by using a multiplier and an adder to multiply the quotient and its 
modulus and sum that result with the associated residue. RNS-Q systems were not 
implemented or tested. Implementation was not performed because an overall system 
would have to be developed to enable design tradeoffs between other methods. 

2. RNS with Quotient on Demand (RNS-QD) 

RNS-QD is based on the principles outlined for the RNS-Q representation 
given in the preceding section. The principle in RNS-QD is that the quotient is 
looked up when needed, hence the name RNS quotient on demand. Implementation 
of this system utilizes the advantages of the RNS-Q representation while maintaining 
the inherent strengths of the conventional RNS of being carry-free and fault tolerant. 
The loss of these traits were the principle drawbacks to the RNS-Q system. 

Look-up tables were considered a disadvantage in the traditional RNS 
implementation. This was due to the fact that the table size grew in proportion to 
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the range of the system and were typically implemented in ROM-based structures. 
The fact that each RNS value was a unique representation made minimization of 
the number of sum-of-product terms required for implementation very slight. ROM- 
based table look-up systems also were a poor use of available silicon area, as described 
in the introduction, unless every memory site available is required. RNS-QD systems 
can be implemented so as to overcome some of these drawbacks. 

PLA structures can be utilized to make more efficient use of available silicon 
real estate, than a similar realization based in ROMs, as long as the function can be 
expressed as a sum-of-products. This is true for both the RNS and RNS-QD systems. 
The advantage of the RNS-QD system is that there is substantial minimization that 
can be gained from the SOP terms used to define the function. Quotients derived 
from any of the moduli are not unique to each value in the range of the system. Each 
modulus’ quotient set is determined by the product of the other moduli of the system 
and forms a partition of the set of all values in the range [Ref 17]. RNS with moduli 
set {3, 5, 7} has 15 different quotients (0 - 14) for the modulus seven, 35 quotients for 
three, and 21 quotients for five in the system range of 105. This also illustrates why 
selection of the largest modulus limits the number of carries required in a RNS-Q 
implementation. 

Looking at the {3, 5, 7} system implementation we should choose 7 as the 
modulus of choice for which the quotient table will be generated. This means that 
there will be fifteen different quotients required to cover the range 0 - 104 (3 * 5). 
Ideally we could hope that the number of logic equations required to implement the 
table would reduce to fifteen. Although this is not the case, it is the number of output 
terms that we need to complete the quotient index. 
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3. Pipelined Mixed Radix Conversion 

Fundamentally a pipeline in a computer system is formed when a large 
task is broken up along natural boundaries into sub-tasks. This is most frequently 
encountered in the execution of instructions within a central processing unit but, can 
also be found in floating point processors and other arithmetic devices. Breaking up 
a task into smaller units, called stages, allows for the clock cycle to be shortened such 
that all stage outputs are completed at the end of the cycle. Minimum clock cycle 
duration is limited to the slowest stage’s speed. Cascading the stages together forms 
what is referred to as a pipe. Generally speaking the latency from entering the pipe 
to exiting the pipe is longer than if the task were not subdivided. The idea is that as 
long as there is one instruction entering the pipe every clock cycle theoretically, over 
long periods of time, instruction execution will approach one per cycle. [Ref. 19] 

Conversion from RNS to mixed radix is an easy task, but the following 
example should clear up any doubts and also illustrate the cascading nature of the 
conversion. The first step is to calculate the conversion constants c,j for the moduli 
set being used. Conversion constants are obtained using Equation 1.3, and, once they 
have been calculated for a given moduli, set they never have to be recalculated. Using 
the moduli set {5, 3, 2} (with m 3 = 5, m 2 = 3, and mi = 2), the conversion constants 
are: Ci 2 = 2, Ci 3 = 3, and c 23 = 2. Proceeding with the rest of the conversion the 

RNS value (4, 1, 1) was chosen to be converted. Using Equations 1.4 through 1.7 the 

values of the v,-’s can be obtained from the u,’s as follows: 

i’] — ui mod mi = uj = 1 

v 2 = (u 2 ~ Vi)ci2 mod m 2 = [(1 — 1) * 2] mod 3 = 0 

v 3 — [((«3 — Vi)ci 3 — f 2 ) * c 23 ]mod m 3 = [((4 — 1) * 3 — 0) * 2]mod5 = 3 
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The mixed radix representation for the RNS value (4, 1, 1) is (3, 0, 1). A pro- 
gram can easily be written to perform this function automatically given the input 
moduli set. 

Mixed radix conversion lends itself well to this type of implementation, due 
to the nature of the conversion each step depends in order on all the preceding steps. 
This is evidenced from equations 1.3 to 1.7 in Chapter I. Systems that perform large 
numbers of conversions or that perform many in bursts can benefit from pipelined 
structures. There are several stages that can be cascaded to form the pipe and the 
length of the pipe is in direct proportion to the number of moduli in the system. 
Individual stages are also fairly modular and can be built with good regularity, which 
lend themselves to the VLSI environment. The final stage, needed for full conversion, 
is not always required if only a comparison is to be done. Enabling or disabling the 
final stage could be done to reduce pipeline latency for situations when full conversion 
is not required. 
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III. IMPLEMENTATION 



Circuitry for the RNS-QD system and the straightforward full conversion look- 
up tables were designed and implemented using programmable logic arrays. The 
following sections provide a detailed description of the design process. 

A. GETTING STARTED 

The first steps in any engineering design process are choosing the methods of 
implementation that best fit the task. This consisted initially of deciding how best 
to generate differing layouts for the purpose of determining which was more efficient. 
Generating layouts in VLSI can be extremely time consuming if there is a desire 
for a full custom realization. The decision was made to implement the traditional 
RNS with full conversion table look-up and the RNS-QD with a quotient look-up 
table utilizing PLAs. This was based on the fact that these designs could be easily 
accomplished. w T ere very modular, and could be easily contrasted as to which was 
more efficient. RNS-Q and the pipelined MRC systems are of the type that their 
efficiency would be best demonstrated in a full system implementation and are not 
easily compared to the other two solutions. 

To realize a PLA design, one must develop the logic equations necessary, mini- 
mize the initial equation set, and generate the PLA layout. Several design tools are 
available for aiding the designer. Espresso is a design tool for logic equation minimiza- 
tion that is part of the magic VLSI computer assisted design (CAD) tool [Ref. 16]. 
Utilization of espresso greatly reduces the development time required of the designer 
and in some cases makes an impossible problem reasonable. The output generated by 
espresso is formatted for direct use by the tool mpla. Mpla generates a PLA layout 
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automatically when given the SOP equations that must be realized. Without a tool 
such as mpla, the designer could proceed with a PLA design by laying out cells for 
basic AND and OR gates. The AND and OR gate cells could be connected together 
in the proper sequence so as to realize the same logic equation as done by mp/a, but 
this is still more time consuming. There was only one tool missing - a tool to generate 
the logic equations was needed. The programming environment chosen was C. This 
choice was based on the need for rapid prototyping and the simplicity of the programs 
that were required. 

The first level program written was to verify the structures of RNS and in the 
hopes of providing some additional insight into the interplay between the relatively 
prime moduli of the system. Further refinement of this entry level program resulted in 
versions that created RNS-Q type systems for graphic verification of the uniqueness 
theorem introduced in Chapter II. Another program modification yielded mixed radix 
representations of RNS, which allowed illustration of the fact that this is an “ordered” 
system. Final refinement resulted in two programs to generate logic equations in the 
output format required for use by espresso , the VLSI minimization tool to be used. 
One program generates equations for a traditional RNS table look-up conversion 
(cnvrtres.c), the other for RNS-QD quotient look-up table (qlugen.c), both of which 
are contained in Appendix A. 

B. IMPLEMENTATION PROCESS 

Once the programs were performing properly to enable generation of the logic 
equations, the question of what systems should be built arose. The decision was 
made to test several implementations that had similar dynamic ranges. This enabled 
checking of the hypothesis that by choosing a much larger, largest prime modulus, 
i.e. significantly larger than the other moduli, if there were any savings gained by 
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having less quotients in the table. The direct power of two moduli was also checked 
to investigate savings that had been realized for power of two adders and multipliers 
[Ref. 9]. Obviously one pitfall to this approach is that the dynamic ranges for two 
differing residue number systems will not be identical, but they can be close enough. 
The choice of a dynamic range is based on the fact that it is large enough so that 
there will be little chance of overflow out of the range. Dynamic range requirements 
are a floor setting, or minimum setting, much the same as number of bits needed in 
a computer. If a minimum word length needed is eight bits but you have a 16 bit 
machine there are advantages to using the larger word size computer, while placing 
no limitations on the requirements. 

1. Initial Test 

RNS sets that were initially evaluated were based on the requirements for 
a minimum dynamic range of 105. This allowed for use of the previously referred 
to system of relatively prime moduli {3, 5, 7}. Two other systems were chosen for 
comparison that had a dynamic range greater than 105. These two systems have 
moduli of {3, 5, 8} and {2, 5, 11} with dynamic ranges of 120 and 110 respectively. 
The assumption on dynamic range for this implementation is that the system requires 
a minimum range of 100. Another point of interest between these three systems is that 
they all maintain the same number of input (8) and output (4) bits. Maintaining an 
equal number of I/O bits is not necessarily a requirement, but provides for stability 
between the systems and, as a result, did not become a point of contention when 
the final evaluation was made. Implementation of these systems was done for both 
the traditional RNS and the RNS-QD methods to allow contrasting the costs of 
realization. 

The implementation was realized using PLAs generated from the mini- 
mized logic equations from espresso and were fed directly to mpla for VLSI layout. 
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Figure 3.1: Comparison of RNS and RNS-QD PLA Implementation 
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Minimization by espi'esso resulted in an average reduction of 20 - 30% of the original 
logic equations generated by the C program for the RNS-QD system. Virtually no 
minimization was realized when espresso was used on the logic equations for the tra- 
ditional RNS conversion look-up table. This is not surprising when one thinks about 
it. The RNS values are unique, and they correspond to unique integer numbers, so 
one should not expect to be able to realize any reduction in size. An exception was 
discovered when a direct power of two modulus was implemented (except where the 
modulus was 2 itself). In this case, there was some logic minimization encountered, 
but it was still less than what was afforded one in the quotient look-up table. This 
minimization contributed to reducing the width of the PLA, which in essence is reduc- 
ing the number of terms that must be fed to the OR gate plane. The majority of the 
minimization that occurred in the quotient look-up table accounted for reductions in 
the height of the PLA, or the number of AND gates required for realization. There 
was also some reduction in the width of these structures for the RNS-QD format. 
Reduction in the height of the PLA has the greatest effect on the overall area, and 
the slight reduction in the width for the direct power of two conversion PLA still had 
an overall growth in total area from the previous smaller implementation. Figure 3.1 
shows the differences in overall area of the RNS versus the RNS-QD implementations. 
The average savings in area gained by using the RNS-QD system is 30%. 

2. In-Depth Testing 

After performing this initial test there was a desire to attempt a more in- 
teresting problem. The University of Florida has been working on a pipelined mixed 
radix converter for the moduli set {101, 109, 113}, and it would be interesting to com- 
pare results [Ref. 20]. Using the same C programs used to develop the logic equations 
seemed easily accomplished. Generating the complete set consumed all the available 
disk space and thereby shut down the ECE Vax. Looking at the espresso program, 
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also seems to indicate that a maximum of 3000 stack pushes are available before an 
error is received. This would have been exceeded by an attempt at minimization of 
the logic equations for this system, which are a little over one million. 

To investigate some moduli sets with larger dynamic ranges, and to avoid 
computation problems, another set of moduli was chosen of approximately ten times 
the dynamic range of the first set. The sets of relatively prime moduli are {3, 5, 64}, 
{7, 11, 16}, and {3, 11, 32}. Use of these moduli sets gives a dynamic range of 960 
- 1232. The choice of these sets was driven by the desire to further investigate the 
ideas of the “much larger moduli” hypothesis for savings in area, and the power of 
two savings that had been realized in adder and multiplier implementations [Ref. 9]. 

This time the generation of the logic equations was accomplished without 
putting any heavy strain on the Vax’s available disk space. Minimization by espresso 
was fairly time consuming but was completed in about two hours for each set of 
moduli. Full PLA implementation was not required as from previous results analysis 
of the reduction in the amount of logic equations would be a good indicator of the 
size of each PLA. The results are tabulated in Table 3.1. From Table 3.1, it is easy 
to see that the most significant reduction came from the implementation of the set 
{3, 11, 32}. Earlier work would have lead us to believe, or at least hope for, the 
greatest reduction in the set {3, 5, 64}. What Table 3.1 doesn’t show is that the 
number of output lines for the {3, 5, 64} set is four, while there are six output lines 
required for implementation of the {3, 11, 32} representation. Input and output lines 
had remained constant throughout the initial test set and were not variables to be 
evaluated. Decreasing the number of routing lines required is always welcome, and 
may be more significant than the extra area consumed by the larger PLA. The PLA 
may even be smaller due to this decreased bandwidth needed for the system with the 
largest quotient modulus, thereby needing less quotients as outputs. 
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TABLE 3.] 


L: Results of Minimization By Espresso 


Moduli 

Sets 


# of Original 
Equations & 
Dynamic Range 


# of Reduced 
Equations 


Percent 

Reduction 


{ 7, 11, 16} 


1232 


828 


33% 


{3, 11, 32} 


1056 


511 


52% 


{3, 5, 64} 


960 


555 


42% 


{3, 13, 29} 


1131 


831 


26% 


{3, 11. 31} 


1023 


556 


46% 


{3, 11, 34} 


1122 


556 


50% 



The actual PLA implementation was done by sending the reduced equation 
set to the VLSI tool mpla for automatic realization. The results of this step are 
contained in Table 3.2. The dimensions used for height, width, and area are in terms 
of the technology being used in microns. Three micron technology is being used, so the 
dimensions would be divided by three to obtain the actual dimensions in micrometers 
(or by nine micrometers for area), to get the size using one micron technology. Table 
3.2 shows that the {3, 11, 32} implementation uses the least amount of area, as 
predicted by the least number of logic equations required to realize this system. The 
{3, 5, 64} system is only slightly larger than the {3, 11, 32} system and is not as 
wide. Width, of a PLA, is an indication of the number of gate delays from input to 
output. Although timing analysis was not performed on these circuits, the less gate 
transitions required implies the faster the operation of the circuit. 

Testing was performed to investigate for power-of-two advantages that 
had been discovered previously [Ref. 9]. The implementations tested were for moduli 
sets {3, 11, 29}, {3, 11, 31}, {3, 11, 34}, and the previously implemented system 
{3, 11, 32}. The dynamic ranges, percent minimization, and area cost statistics are 
shown in Tables 3.1 and 3.2. Once again the system {3, 11, 32} holds an advantage 
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TABLE 3.2 


: In De] 


oth Test Results 


Moduli 

Sets 


Height 


Width 


Total 

Area 


{7, 11, 16} 


7068 


462 


3265416 


{3, 11,32} 


4388 


390 


1711320 


{3, 5, 64} 


4762 


382 


1819084 


{3, 13, 29} 


7092 


438 


3106296 


{3, 11, 31} 


4794 


382 


1831308 


{3, 11, 34} 


4770 


398 


1898460 



over the other implementations, but the {3, 11, 31} system is a close second. Full 
implementation of the {3, 11, 31} system will require much more total area, because 
the residue multipliers and adders are much larger [Ref. 9]. It appears that there is 
fairly linear growth in the size of the PL A structure away from a direct pow r er-of-tw T o 
implementation, in both the positive and minus directions. The choice in this case 
would be the system with moduli set {3, 11, 32}, for the best overall savings. 

C. DESIGN VERIFICATION 

To completely authenticate a design there must be some type of verification 
performed to guarantee its validity. Verification for the implementation of the RNS- 
QD and traditional RNS PLA layouts was done with the use of a tool called RNL 
[Ref. 21]. RNL is a timing logic simulator for digital MOS circuits. It is an event 
driven simulator that uses simple resistance-capacitance model of the circuit that has 
been extracted from the VLSI layout done in magic. RNL allows for verification of 
the device of interest by using timing files and node transitions as inputs. There are 
other simulation tools that can be used if a more detailed circuit analysis is desired, 
but the goal of this testing is to verify that the for a given set of input vectors the 
proper outputs are received. 
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Design verification was performed on the first few PL As implemented. A sample 
RXL file is contained in Appendix C, that includes timing, clock speeds, inputs, and 
outputs received for the system {3, 5, 7}. After ensuring that the logic equation gen- 
erating programs functioned properly further design verification was not performed. 
Worst case timing analysis was not accomplished, but could also be done using RNL 
and the glitch detector [Ref. 21]. 
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IV. CONCLUSIONS 



Investigation into the usefulness of RNS in computer arithmetics has been going 
on for some time. One of the principle drawbacks has been the difficulty in comparing 
two residue numbers without conversion to some other form. The traditional method 
for comparison is a ROM based table look-up; which is relatively slow and uses a 
large amount area. Proposals for the use of RNS-Q, RNS-QD, and pipelined MRC 
have been presented and analyzed in this paper. These proposals offer savings in area 
required and may offer speed advantages. 

The PL A implementation of RNS-QD offers a significant savings in terms of sil- 
icon area over the straightforward ROM look-up method. The larger moduli concept 
did not show an overall decrease in the size of the PLA required for implementation, 
but did yield a lower number of output lines, thereby reducing routing requirements 
for the circuit. Power-of-two investigation showed that there are some savings to 
be gained from implementing these systems, but was not as dramatic as previously 
discovered for the power-of-two adders and multipliers. There is still the necessity 
for a multiplier and an adder to facilitate full conversion to a conventional weighted 
number. If the need for comparisons occur much more frequently than full conver- 
sions, the PLA RNS-QD is a more viable method. However, if full conversions are 
required in conjunction with virtually every comparison, than this method may not 
yield a significant speed advantage over the ROM approach. 

RNS-Q systems offer distinct advantages over both the traditional ROM and 
RNS-QD methods in terms of silicon real estate. This is especially true if the num- 
ber of comparisons that a given system requires is very high compared to all other 
mathematical operations. The drawbacks to this approach are that the system is no 
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longer carry-free and because of this there is some loss in the inherent fault tolerance 
of conventional RNS. Loss of fault tolerance is most severe, if damage were to occur to 
the most significant (largest) modulus of the system, which would essentially disable 
the entire circuit. 

The choice of the method to be employed must be approached from the view of 
what type of system is going to be using it to derive the best advantage. Each of the 
methods discussed have their own strong points. Considerations as to what is more 
crucial to the system, such as area occupied or speed of operation, will help to choose 
the implementation that will best fit the needs of the design. 
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APPENDIX A: C CODE UTILIZED 



Enclosed in this appendix are the two C programs used to generate logic equa- 
tions utilized for PLA generation. 

1. C Code for RNS to Binary Converter 



/*******************+************++*****+*** ***+******+*++#*+ 
PROGRAM: cnvrtres.c * 

FUNCTION: Generates logic equations in the * 

format necessary to be used by the * 
VLSI function "espresso". The output* 
derived can be used by "rapla" to * 
create a PLA layout for a RNS to * 

conventional binary value converter.* 

AUTHOR: David E. Gilbert * 

VERSION: 1.2 * 



DATE (last mod):29 AUG 1991 



it************************************************ **********/ 



#include <stdio.h> 
#include <math.h> 



#def ine TRUE 1 
#def ine FALSE 0 

main(argc, argv) 
int argc; 
char **argv; 

{ 

int mi [5], M, i, j, k, ctr[5], DONE, tmp; 
int digit, tmp_div, raise_it; 

M=l; 

for(i=l; i< argc; i++) { mi [i] =atoi (argv [i] ) ; 

M = M*mi [i] ; 

> 

mi [4] = M; 

/* CALCULATE THE POWERS OF 2 NEEDED FOR NUMBERS OF BITS NEEDED */ 

for(i = 1; i <= 4; i++) 

{ 

ctr[i] = 0; 
tmp = mi [i] ; 
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DONE = FALSE; 
while( ! DONE ) 

{ 

tmp = tmp / 2 ; 
ctr[i] ++; 

if ( tmp == 0 ) DONE = TRUE; 

> 

> 

printf( M # conversion table look-up for RNS with { '/.d, */ # d, '/,d > \n M , mi[l], 
mi [2] , mi [3] ) ; 

printf (".i */,d \n” , ctr[l] + ctr[2] + ctr[3]); 

printf(".o */. d \n", ctr[4]); 
printf (” .phase "); 

f or ( i =1; i <= ctr[4] ; i+ + ) printf ("1 M ); 
printf C’\n \n ") ; 

/* COMPLETION OF BIT CALCULATION PORTION OF PROGRAM */ 

/* Calculate the bit fields required and output */ 
for( i = 0; i < M; i++ ) 

{ 

for( k = 1; k <= 4; k+ + ) 

{ 

DONE = FALSE; 
raise.it = ctr[k]; 
tmp = i; 

while ( ! DONE ) 

{ 

tmp = tmp */, mi [k] ; 
raise_it = raise_it - 1; 
tmp_div = 1; 

for(j = 1 ; j <= raise_it; j++) 

tmp_div = tmp_div * 2; 

digit = tmp / tmp_div; 

tmp = tmp */* tmp_div; 

if (digit ! = 0) printf ("1") ; 

else printf ("0"); 

if ( raise.it == 0 ) DONE = TRUE; 

> 

printf (” ”); 

> 

printf ("\n") ; 

> 

printf (” . e" ) ; 

> 
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2. C Code for Quotient Table Generation 



/****++*+**+*+*****+***+*+ ************ * ******************** 

* PROGRAM NAME: qlugen.c * 

* FUNCTION: Generates minterm equations in the format * 

* required by the VLSI tool "espresso” for * 

* minimization and then implementation as * 

* PLA for an RNS-QD quotient look-up table. * 

* VERSION: 1.4 * 

* AUTHOR: David E. Gilbert * 

* Date Last Changed: 10 Sep 1991 * 

**********************************************************/ 

#include <stdio.h> 

#include <math.h> 

#def ine TRUE 1 
#def ine FALSE 0 

main(argc, argv) 
int argc; 
char **argv; 

t 

int mi [5], M, i, j, k, ctr[5] , biggest, tmp, DONE; 
int enter, digit, tmp_div, raise.it; 

M=1 ; 

for(i=l; i< argc; i++) { mi [i] =atoi(argv [i] ) ; 

M = M*mi [i] ; > 

/* Find the largest moduli */ 

biggest = 0; 

for(i = 1; i < argc; i++) 

{ 

if(mi[i] > biggest) biggest = mi[i]; 

> 

mi [4] = M / biggest; 

/* CALCULATE THE POWERS OF 2 NEEDED FOR NUMBERS OF BITS NEEDED */ 
for(i =1; i <= 4; i++) 

{ ctr [i] = 0; 

tmp = mi [i] ; 

DONE = FALSE; 
while ( ! DONE ) 

{ tmp = tmp / 2; 
ctr[i] ++; 

if ( tmp == 0 ) DONE = TRUE; } }/*end of while */ 

printf("# conversion table look-up for RNS with { */,d, '/.d, */,d > \n M , mi[l], 
mi [2] , mi [3] ) ; 

printf (" . i */ # d \n M , ctr[l] + ctr [2] + ctr [3]); 
printf ( " . o */,d \n”, ctr[4]); 
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printf (" .phase ") ; 

for(i = 1; i <= ctr[4] ; i++) printf ("l") ; 
printf (”\n \n ") ; 

/* COMPLETION OF BIT CALCULATION PORTION OF PROGRAM */ 



/* Calculate the bit fields required and output */ 

mi [4] = biggest; /* This is a temporary fix and should be corrected */ 
for( i = 0; i < M; i++ ) 

for( k = 1; k <= argc; k++ ) 

DONE = FALSE; 
raise.it = ctr[k]; 
tmp = i; 
enter = 1; 
while( !D0NE ) 

if (k == argc ) 

if ( enter == 1) tmp = tmp / mi[k]; 
else tmp = tmp */, mi [k] ; 
enter ++; 

> 

else tmp = tmp '/, mi[k]; 
raise.it = raise.it - 1; 
tmp.div = 1 ; 

for(j = 1; j <= raise.it; j++) 
tmp.div = tmp.div * 2; 

digit = tmp / tmp.div; 

tmp = tmp % tmp.div; 

if (digit ! = 0) printf ("1") ; 

else printf ("0"); 

if ( raise.it == 0 ) DONE = TRUE; 

> 

printf (" "); 

> 

printf ("\n") ; 



> 

printf (".e") ; 

> 
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APPENDIX B: SAMPLE EQUATIONS 

1. Output from QLUGEN.C 



# conversion table look-up for RNS with {2, 3, 5 } 
.i 7 
.o 3 

.phase 111 



00 


00 


000 


000 


01 


01 


001 


000 


00 


10 


010 


000 


01 


00 


on 


000 


00 


01 


100 


000 


01 


10 


000 


001 


00 


00 


001 


001 


01 


01 


010 


001 


00 


10 


Oil 


001 


01 


00 


100 


001 


00 


01 


000 


010 


01 


10 


001 


010 


00 


00 


010 


010 


01 


01 


Oil 


010 


00 


10 


100 


010 


01 


00 


000 


Oil 


00 


01 


001 


Oil 


01 


10 


010 


Oil 


00 


00 


on 


on 


01 


01 


100 


on 


00 


10 


000 


100 


01 


00 


001 


100 


00 


01 


010 


100 


01 


10 


on 


100 


00 


00 


100 


100 


01 


01 


000 


101 


00 


10 


001 


101 


01 


00 


010 


101 


00 


01 


on 


101 


01 


10 


100 


101 



. e 
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2. Reduced Equations from Espresso 

The following is a listing of the reduced set of equations from the original 
moduli set {3, 5, 7} after being processed by espresso. These are the equations that 
would be used to generate the quotient look-up table PLA with the VLSI tool mpla. 



# conversion table look-up for RNS with { 2 
.i 7 
.0 3 


# .phase 

.p 20 


111 


0010100 


010 


0110100 


100 


0101011 


010 


0000100 


100 


0110011 


100 


0101100 


Oil 


0110001 


010 


0101000 


100 


000101- 


100 


0100010 


100 


0110010 


Oil 


0100001 


100 


000100- 


010 


001000- 


100 


0100000 


010 


000001- 


010 


000-0-1 


001 


00-00-1 


001 


01-0-00 


001 


010-0-0 


001 



. e 
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APPENDIX C: SAMPLE RNL FILES 



1. Sample RNL Execution File 

The following is a listing of an execution file to simulate the RNS-QD 
system {2, 3, 5}. 

; The name of this control file for ml is: qlu2.1 

; This is the control file for simulation on a PLA for quotient look-up 

; LOAD STANDARD LIBRARY ROUTINES 
(load "uwstd.l") 

(load "uwsim.l") 

; FILE WHICH WILL LOG THE RESULTS 
(log-file "qlu2.rlog") 

; READ IN THE BINARY NETWORK FILE 
(read-network "qlu2") 

(sim-init ) 

; DEFINE THE TIME SCALE FOR SIMULATION 
(setq incr 10) 

; DEFINE INPUT VECTOR IF ANY, standard STYLE 

(defvec ' (bit status input.l input_2 input_3 input_4 input_5 input. 6 input_7)) 
(defvec '(bit output output.l output. 2 output_3)) 

; DEFINE INPUT VECTOR IF ANY, SINGLE INDEX STYLE 

; DEFINE INPUT VECTOR IF ANY, double index STYLE 

; STANDARD REPORT FORMAT DEFINITION. 

(def-report '("response = " clka clkabar (vec output) (vec status))) 

; PLOTFILE SPECIFIED 
openplot "qlu2.beh" 

; LOGIC ANALYZER STYLE OUTPUT FORMAT SELECTION. 

(setq lanalyze t) 

(wr-f ormat) 

; GLITCH DETECTOR SELECTION. 

; NODE TRANSIENTS REPORT DEFINITION. 

(chflag '( output. 1 output_2 output_3)) 

; TRIGGER CONDITION SET-UP 
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; ADDITIONAL SIMULATION SET-UP COMMAND LINES. 

(printf "The simulation starts now.. An") 

; SPECIFICATION OF A TIME/BASENAME FILE FOR INCLUSION. 

(load "qlu2.time") 

; ADDITIONAL WRAP-UP COMMAND LINES. 

(printf "simulation completed ... check file *.rlog for results. \n") 
exit 

; GEN-CONTROL COMPLETED. 

2. Sample RNL Simulation Output 

; 62 nodes, transistors: enh=157 intrinsic=0 p-chan=30 dep=0 low-power=0 pullup=0 resistor=0 

; Report format of logic analyzer style output 
time clka clkabar output status 

The simulation starts now... 



output. 1 = 


1 <D 0 


output. 2 = 


1 <0 0 


output_3 = 


1 <D 0 


1 0 0 111 


0000000 


output. 1 = 


0 0 0.5 


2 1 0 Oil 


0000000 


output. 3 = 


0 <D 0.2 


output_2 = 


0 0 0.2 


output. 2 = 


1 0 0.9 


output. 3 = 


1 <D 0.9 


3 0 0 Oil 


0010101 


output .1 =• 


1 0 0.2 


4 1 0 111 


0010101 


output_3 = 


0 <0 0.4 


output. 2 = 


0 <D 0.4 


output. 1 = 


0 0 0.7 


500 000 


0001001 


output. 3 = 


1 0 0.8 


output_2 = 


1 <D 0.9 


6 1 0 Oil 


0001001 


output. 2 = 


0 <D 0.6 


7 0 0 001 


0000100 


8 1 0 001 


0000100 


output_3 = 


0 0 0.3 


output. 2 = 


1 <0 0.4 


900 010 


0001000 


output. 2 = 


0 0 0.5 


output .1 = 


1 <D 0.9 


10 1 0 100 0001000 


output. 2 = 


1 <D 0.8 



11 0 0 110 1000100 



41 



output.l = 0 0 0 . 6 
output_2 = 0 0 0.6 
output. 2 = 1 0 0.9 
output_l = 1 fi 0,9 
12 1 0 110 1000100 

output.l = 0 0 0.7 
output_2 = 0 0 0.9 

13 0 0 000 0000101 

output_3 = 1 0 0.9 
output_l = 1 0 0.9 

14 1 0 101 0000101 

15 0 0 101 0101000 

output. 1 = 0 0 0.5 

16 1 0 001 0101000 

output_3 = 0 0 0.3 
output_3 = 1 0 0.9 

17 0 0 001 0010000 

18 1 0 001 0010000 

output. 1 = 1 0 0.9 

19 0 0 101 0011001 

output_2 = 1 0 0.2 

20 1 0 111 0011001 

output_3 = 0 0 0.4 
output. 1 = 0 0 0.6 
21 0 0 010 0010100 

output_2 = 0 0 0.6 
22 1 0 000 0010100 

output_3 = 1 0 0.4 
output. 2 = 1 0 0.4 

23 0 0 011 0000001 

output. 2 = 0 0 0.5 
output_3 = 0 0 0.6 
output. 3 = 1 0 0.8 
output_2 = 1 0 0.9 

24 1 0 Oil 0000001 

output.l = 1 0 0.9 

25 0 0 111 0010001 

26 1 0 111 0010001 

output_3 = 0 0 0.4 
output. 2 = 0 0 0.4 
output. 2 = 1 0 0.9 

27 0 0 110 0011000 

28 1 0 110 0011000 

output.l = 0 0 0.3 
output_2 = 0 0 0.3 
output_3 = 1 0 0.4 
output.l = 1 0 0.4 
29 0 0 101 0100101 

output.l = 0 0 0.6 
output_3 = 0 0 0.6 
output_3 = 1 0 0.8 



output_2 = 1 0 0.9 
output_l = 1 <D 0.9 

30 1 0 111 0100101 
output_2 = 0 (5 0.5 
output_l = 0 <B 0.7 

31 0 0 001 1001001 
output_2 = 1 <D 0.9 
output_l = 1 0 0.9 

32 1 0 111 1001001 
output_2 = 0 ® 0.6 

33 0 0 101 1001001 

simulation completed ... check file *.rlog for results. 
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